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Abstract 

We  address  the  problem  of  recognizing  a  place  depicted 
in  a  query  image  by  using  a  large  database  of  geo-tagged 
images  at  a  city-scale.  In  particular,  we  discover  features 
that  are  useful  for  recognizing  a  place  in  a  data-driven  man¬ 
ner,  and  use  this  knowledge  to  predict  useful  features  in  a 
query  image  prior  to  the  geo -localization  process.  This  al¬ 
lows  us  to  achieve  better  performance  while  reducing  the 
number  of  features.  Also,  for  both  learning  to  predict  fea¬ 
tures  and  retrieving  geo-tagged  images  from  the  database, 
we  propose  per-bundle  vector  of  locally  aggregated  de¬ 
scriptors  (PBVLAD),  where  each  maximally  stable  region 
is  described  by  a  vector  of  locally  aggregated  descriptors 
(VLAD)  on  multiple  sc  ale -invariant  features  detected  within 
the  region.  Experimental  results  show  the  proposed  ap¬ 
proach  achieves  a  significant  improvement  over  other  base¬ 
line  methods. 


1.  Introduction 

Image  geo-localization  is  the  process  of  determining  the 
capturing  viewpoint’s  positioning  w.r.t.  a  geographic  refe¬ 
rence  [43].  The  recent  availability  of  large  scale  geo-tagged 
image  collections,  enables  the  use  of  image  retrieval  frame¬ 
works  to  transfer  geo-tag  data  from  a  reference  dataset  into 
an  input  query  image.  Applications  of  these  capabilities 
include  adding  and  refining  geotags  in  image  collections 
[15,  41],  navigation  [26],  photo  editing  [44],  and  3D  recon¬ 
struction  [11].  However,  geo-localization  of  an  image  is  a 
challenging  task  because  the  query  image  and  the  reference 
images  in  the  database  vary  significantly  due  to  changes  in 
scale,  illumination,  viewpoint,  and  occlusion. 

Image  retrieval  techniques  based  on  local  image  fea¬ 
tures  [27]  can  achieve  increased  robustness  against  pho¬ 
tometric  and  geometric  changes  [24,  42].  However,  not 
all  local  features  are  useful  for  geo-localization  [2  ].  For 
example,  features  extracted  from  transient  scene  elements 
(pedestrians,  cars,  billboards)  and  ubiquitous  objects  (trees, 
fences,  signage)  can  introduce  obfuscating  cues  into  the 
geo-localization  process.  Many  approaches  have  been  pro¬ 


posed  to  address  this  issue  by  focusing  on  the  uniqueness  of 
a  feature  by  removing  and  reweighting  non-unique  features 
within  the  reference  data  [21,  32]  or  in  the  query  image  [2]. 
Indeed,  unique  features  are  helpful,  but  a  non-unique  fea¬ 
ture  may  actually  help  increase  the  chance  of  correct  local¬ 
ization,  either  by  itself  or  in  combination  with  others. 

We  exploit  a  data-driven  notion  of  good  features  for 
geo-localization.  That  is,  we  aim  to  foster  features  hav¬ 
ing  relatively  high  matching  scores  in  correct  localization 
outcomes,  in  contrast  to  their  relatively  low  score  for  neg¬ 
ative  outcomes.  Further,  we  cast  feature  score  prediction 
as  a  classification  problem,  assuming  the  characteristics  are 
shared  in  a  reasonably- scaled  geographic  region.  We  use  a 
separate  set  of  geo-tagged  Internet  images  to  generate  train¬ 
ing  data,  computing  matches  against  database  images.  To 
cope  with  noise  and  high  intra-class  variation  among  the 
training  data,  we  adopt  recent  bottom-up  clustering  tech¬ 
niques  for  visual  element  discovery  [8,9]  that  involves  iter¬ 
ative  training  of  linear  support  vector  machines  (SVM).  At 
the  query  phase,  the  algorithm  selects  features  in  a  query 
image  prior  to  the  geo-localization  process  by  accumulat¬ 
ing  predictions  from  the  bank  of  linear  SVMs.  Our  results 
show  improved  performance  is  achieved  by  using  only  fea¬ 
tures  that  are  predicted  as  useful,  while  reducing  the  number 
of  features  significantly. 

The  feature  representation  for  such  a  task  should  not 
only  be  robust  to  photometric  and  geometric  changes,  but 
also  have  a  high  discriminative  power  as  we  want  to  learn 
features  over  a  large  area,  e.g.  a  city.  Therefore,  we  avoid 
using  low-level  features  for  learning,  which  are  hard  to  be 
discriminative  over  a  large  area.  We  propose  a  per-bundle 
vector  of  locally  aggregated  descriptors  (PBVLAD)  for  fea¬ 
ture  representation,  where  each  maximally  stable  (MSER) 
[28]  region  is  described  with  a  vector  of  locally  aggregated 
descriptors  (VLAD)  on  multiple  scale-invariant  features  de¬ 
tected  within  the  region.  This  allows  us  to  represent  multi¬ 
ple  features  with  a  fixed- size  vector  such  that  it  can  be  used 
in  various  classification  methods  such  as  an  SVM.  We  show 
in  the  experiments  that  this  feature  representation  has  sig¬ 
nificant  improvement  over  low  level  features  in  both  learn¬ 
ing  to  predict  features  and  retrieving  images. 
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Figure  1 :  Overview  of  our  approach.  From  an  input  query  image  with  unknown  geo-location  (a),  MSER  regions  and  SIFT  keypoints 
forms  a  bundled  feature  [40],  and  consequently  represented  by  PBVLAD  (b).  Features  go  through  a  pre-trained  bank  of  SYMs  that  outputs 
binary  predictions  about  a  feature  being  “good”  for  geolocalization  (c).  Predictions  are  accumulated  to  compute  confidence  scores  for  each 
feature  (d,  left).  Features  with  high  scores  are  selected  for  geo-localization  (d,  right).  Retrieved  geo-tagged  image  is  shown  in  (e). 


Our  contribution  is  two-fold:  (1)  We  offer  a  way  to  pre¬ 
dict  features  that  are  good  in  a  data-driven  sense  for  geo¬ 
localization  in  a  reasonably- scaled  geographic  region.  We 
show  that  by  selecting  features  based  on  predictions  from 
learned  classifiers,  geo-localization  performance  can  be  im¬ 
proved.  (2)  We  propose  per-bundle  vector  of  locally  aggre¬ 
gated  descriptors  (PBVLAD)  as  a  novel  representation  for 
bundled  local  features  that  is  effective  for  both  learning  to 
predict  features  and  image  retrieval. 

2.  Related  work 

There  are  two  main  categories  in  image  geo-localization 
for  street-level  input  images.  Our  method  falls  into  the  cate¬ 
gory  of  image-retrieval-based  methods  where  a  geolocation 
of  the  image  is  approximated  by  identifying  geo-tagged  ref¬ 
erence  images  depicting  the  same  place  [3,  5,  16,  35].  The 
other  is  to  estimate  the  full  camera  pose  of  the  query  image 
using  a  3D  structure-from-motion  model  constructed  from 
reference  images  [13,  17,  25,  3  ],  which  is  limited  to  places 
with  a  dense  distribution  of  reference  images. 

Our  work  is  mostly  related  to  recent  works  attempting 
to  select  features  that  are  geographically  discriminative  by 
taking  advantage  of  geotags  in  the  database.  Schindler  et 
al.  [3  ]  build  a  vocabulary  tree  using  only  unique  features 
that  appear  at  each  location.  Arandjelovic  and  Zisserman 
[2]  use  distribution  in  the  descriptor  space  as  a  measure  for 
distinctiveness.  Knopp  et  al.  [21]  refine  the  database  by  re¬ 
moving  features  that  match  to  faraway  places.  Rather  than 
finding  features  unique  to  specific  places,  Doersch  et  al.  ['  ] 
find  image  patches  that  also  occur  frequently  in  a  geograph¬ 
ical  region,  and  unique  with  respect  to  other  geographic  re¬ 
gions.  While  these  methods  focus  on  the  uniqueness  of  a 
feature,  we  focus  on  features  that  explicitly  contribute  to 
geo-localization  either  positively  or  negatively.  Although 
unique  features  do  characterize  a  location,  it  may  be  risky 
to  discard  all  non-unique  features,  some  of  which  may  con¬ 
tribute  to  correct  retrieval  by  having  high  matching  score  in 
the  correct  location  than  in  false  positives. 

Some  cast  the  localization  problem  as  a  classification 
problem  where  visual  words  are  weighed  according  to  their 


importance  to  specific  locations  [4,  12].  Conversely,  we 
train  classifiers  to  predict  whether  a  feature  is  useful  for 
geo-localization  over  a  larger  scale  of  geographic  region, 
utilizing  a  separate  set  of  geo-tagged  images  from  photo 
sharing  websites  taken  in  a  city  to  generate  our  training  data. 
Based  on  the  predictions,  we  select  features  prior  to  geo¬ 
localization.  We  show  that  better  performance  is  achieved 
without  using  all  features.  It  is  also  more  scalable  as  the 
training  images  can  be  much  more  sparse  than  the  reference 
images,  with  the  assumption  that  these  characteristics  are 
shared  among  images  in  the  same  geographic  region. 

In  the  fields  of  image  retrieval,  there  is  a  large  body  of  lit¬ 
erature  on  feature  selection  and  weighting  [30,  36,  38,  45]. 
The  closest  work  to  ours  is  [33],  which  tries  to  find  the  im¬ 
portance  of  each  feature  by  training  a  per-examplar  SVM 
on  a  given  query  image  with  hard  negative  mining.  While 
this  method  can  be  effective,  it  is  time  consuming  as  a  fresh 
model  is  trained  every  time.  In  constrast,  we  refine  and  or¬ 
ganize  the  outcomes  of  geo-localizing  training  images  in 
offline,  and  use  this  knowledge  for  selecting  features. 

In  terms  of  selecting  features  in  advance  to  matching  in 
a  data-driven  way,  our  work  is  closely  related  to  [14],  but 
with  different  focuses.  Whereas  [14]  tries  to  predict  fea¬ 
tures  that  are  likely  to  form  a  match,  we  predict  features  that 
contribute  to  correct  geo-localization.  As  we  show  in  our 
experiment,  not  all  matches  are  useful  for  geo-localization. 

Applying  VLAD  to  local  regions  in  previous  work  was 
either  based  on  tiles  from  rectangular  grids  [  ]  (as  in  spa¬ 
tial  pyramids  [23]),  or  on  bounding  boxes  [31  ],  which  are 
not  robust  to  geometric  changes.  We  propose  to  use  VLAD 
for  representing  a  bundled  feature  [40],  which  consists  of 
SIFT  keypoints  and  an  MSER  region  that  are  both  repeat- 
able,  thus  resulting  our  PBVLAD  to  be  robust  to  geometric 
and  photometric  changes. 

3.  Proposed  approach 

The  overview  of  our  approach  is  shown  in  Figure  1.  In 
this  section,  we  first  introduce  our  proposed  feature  repre¬ 
sentation  for  image  retrieval  and  training  calssifiers  (Sec. 
3.1).  We  then  illustrate  our  training  framework  for  automat- 
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ically  generating  training  data  and  training  a  bank  of  S  VMs 
for  predicting  good  features  for  geo-localization  (Sec.  3.2). 

3.1.  Per-bundle  VLAD  for  feature  representation 

We  want  to  identify  parts  of  an  image  that  are  useful 
for  geo-localization,  using  a  discriminative  classification 
method  such  as  SVM.  However,  it  is  a  hard  problem  to  learn 
such  characteristics  given  a  low  level  description  of  a  cor¬ 
ner  or  a  blob.  Thus,  we  propose  per-bundle  vector  of  locally 
aggregated  descriptors,  namely  PBVLAD.  The  key  idea  is 
to  use  groups  of  low  level  features,  and  describe  them  in 
a  vector  with  a  fixed- size  that  allows  it  to  be  compared  in 
standard  distance  measures,  and  enables  it  to  be  used  for 
various  classification  methods. 

The  concept  of  a  bundled  feature  was  proposed  by  Wu  et 
al.  [40]  for  retrieving  partial-duplicate  images.  By  bundling 
multiple  SIFT  features  detected  in  the  same  MSER  regions, 
the  discriminative  power  is  increased  while  still  being  re¬ 
peatable,  as  both  components  are  robust  to  photometric  and 
geometric  changes.  The  original  representation  was  a  con¬ 
catenation  of  quantized  SIFT  features,  which  changes  in 
length  as  a  MSER  region  can  contain  different  number  of 
SIFT  features.  The  similarity  between  two  bundled  features 
was  measured  by  computing  intersection  between  them.  In 
this  paper,  we  propose  to  describe  a  bundled  feature  with  a 
vector  of  locally  aggregated  descriptors  (VLAD)  [19].  This 
representation  produces  sparse  vectors  with  a  fixed- size  that 
is  convenient  for  comparing  distances  and  training  classi¬ 
fiers  such  as  SVM.  Compared  to  the  bag-of- words  (BoW) 
representation,  VLAD  can  have  a  much  smaller  dimension 
while  maintaining  high  discriminative  power,  and  it  can  be 
further  quantized  without  significant  loss  in  performance. 
Note  that  Min-hash  sketches  can  also  provide  a  compact 
representation  [6],  but  it  has  a  comparably  low  recall  and  a 
limited  number  of  applicable  classification  methods  as  stan¬ 
dard  distance  measures  cannot  be  applied. 

Let  R  and  S  denote  the  MSER  regions  and  SIFT  fea¬ 
tures  detected  in  image  /,  respectively.  Each  MSER  region 
r  G  R  contains  a  set  of  SIFT  features  B  C  S  that  are  de¬ 
tected  within  that  region  B  =  {s  =  (d,l)\l  G  r},  where 
d  and  l  denote  the  descriptor  and  the  location  of  the  SIFT 
feature.  B  is  called  a  bundled  feature  [40].  For  a  bundled 
feature  Ba ,  its  associated  SIFT  features  sa  =  (da,  la)  G  Ba 
are  each  assigned  to  a  visual  word  of  a  coarse  vocabu¬ 
lary  W  via  nearest  neighbor  search  such  that  NN(da )  = 
argmin||da  —  cw ||,  where  cw  is  the  centroid  of  the  visual 

w 

word  w.  The  sub  vector  of  per-bundle  VLAD  that  corre¬ 
sponds  to  the  visual  word  w,  denoted  as  p™,  is  obtained 
as  an  accumulation  of  differences  between  df  s  that  are  as¬ 
signed  to  w  and  the  centroid  cw.  As  proposed  in  [7],  we 
normalize  the  differences  (i.e.,  residuals ),  so  that  each  con¬ 
tribution  of  SIFT  descriptor  di  to  the  vector  pf  are  equal. 
This  is  to  limit  the  effect  of  possible  noise,  although  bundled 


features  are  robust  to  photometric  and  geometric  changes. 


a  ^  \\di  -  cw\\  w 

di:NN(di)=w,dieBa  11  11 

The  final  representation  is  the  concatenation  of  the  vectors 
p ™  followed  by  L 2  normalization. 

Pa=[pl,pl,-,p\!Vl]  (2) 

We  tested  multiple  normalization  schemes  [1,  19],  but  the 
combination  of  residual-  and  L2-  normalization  performed 
the  best  in  our  data.  The  PBVLAD  representation  of  corre¬ 
sponding  bundled  features  are  visualized  in  Figure  2. 
Similarity  metrics.  The  similarity  between  two  PBVLAD 
is  computed  as  their  dot  product  M(pa,pb)  =  pa  •  p^.  Fig¬ 
ure  3  depicts  the  matched  feature  regions  of  two  corre¬ 
sponding  images.  We  define  the  matching  score  f  of  a  fea¬ 
ture  pq  in  a  query  image  Iq  to  a  reference  image  Ir  as  the 
maximum  possible  similarity  between  pq  and  features  in  Ir . 
The  image  similarity  Sim  between  a  query  image  Iq ,  and 
the  reference  image  Ir  becomes  the  sum  of  matching  scores 
of  individual  features  pq  G  Iq  with  respect  to  Ir. 

f(pq,  Ir)  =  max  M(pq,pr),  (3) 

Pr£lr 

Sim(Iq.  Ir)  =  E  f(Pldr)  (4) 

PqCilq 

We  use  above  image  similarity  measure  to  retrieve  reference 
images  that  best  matches  the  query  image. 

For  efficient  nearest  neighbor  search  in  the  reference 
data,  we  reduce  the  dimension  of  raw  PBVLAD  using 
principal  component  analysis  (PC A).  Instead  of  performing 
PCA  on  a  whole  vector,  we  do  on  a  per- visual- word  basis  by 
performing  PCA  on  subvectors  pw  that  are  generated  from 
each  visual  word  w.  We  do  this  in  order  to  preserve  the 
characteristics  of  each  visual  words  that  might  be  lost  due 
to  the  overall  sparsity  of  the  vector.  In  our  implementation, 
a  coarse  vocabulary  of  128  visual  words  was  used,  yield¬ 
ing  16,384-dimensional  raw  PB VLAD’s.  The  dimension  is 
then  reduced  to  2,048  by  performing  PCA  on  128  visual 
words  and  taking  the  top  16  components  of  each.  Note  that 
PBVLAD  matching  can  be  efficiently  indexed  using  prod¬ 
uct  quantization  [18].  Henceforth,  the  term  feature  will  re¬ 
fer  to  PBVLAD  representation  of  a  bundled  feature. 

3.2.  Predicting  good  features  for  geo-localization 

Automatic  training  data  generation.  Given  an  arbitrary 
set  of  geo-tagged  images  Xt  =  {It},  we  want  to  automat¬ 
ically  generate  good/bad  training  examples  of  features  for 
geo-localization  using  only  their  associated  GPS  locations. 
Rather  than  having  assumptions  about  good  and  bad  fea¬ 
tures  for  geo-localization,  we  want  to  find  them  in  a  data- 
driven  way.  This  enables  our  method  to  adapt  to  various 
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Figure  2:  PBYLAD  representation  of  corresponding  bundled  features.  (a,e)  Two  different  images  depicting  the  same  place.  (b,d)  Multiple 
SIFT  features  are  bundled  within  MSER  regions,  (c)  Each  bundle  is  represented  with  VLAD.  We  follow  the  visualization  scheme  of  [19] 
where  subvectors  are  represented  in  4x4  spatial  grid  with  red  representing  negative  values.  Note  that  only  non-sparse  blocks  that  correspond 
to  overlapping  visual  words  of  two  bundles  are  visualized  due  to  space  limit. 


Figure  3:  Matching  with  PBVLAD  with  similarity  threshold  0.5 

geographical  regions.  For  each  image  in  the  training  set, 
we  retrieved  top  n  =  100  images  from  the  reference  set 
Xr  =  {Ir}  using  image  similarity  defined  in  Eq.  4.  We 
investigate  whether  a  feature  in  a  training  image  pt  G  It  is 
explicitly  contributing  to  the  correct  retrieval  of  the  ground 
truth  image.  To  this  end,  we  compare  a  feature’s  matching 
score  to  a  ground  truth  reference  image  /(pt,  Igt ),  against 
the  matching  score  to  a  falsely  retrieved  images  f  (pt.  If p)- 
Given  that  the  overall  image  similarity  between  two  im¬ 
ages  is  the  sum  of  individual  matching  scores  (Eq.  4),  this 
comparison  helps  us  differentiate  good  features  based  on 
their  individual  contribution.  If  the  difference  between  two 
values  \f (pti  Igt)  —  fiPt^Fp)  |  is  greater  than  a  certain 
threshold,  we  include  the  feature  into  the  training  set,  as- 
signing  positive  label  when  f(pt,  IGt )  >  f(Pt,  Pr),  neg- 
ative  label  otherwise.  This  process  is  depicted  in  Figure 
5(a-d)  and  provides  the  initial  positive  and  negative  training 
feature  set  for  data-driven  visual  component  discovery. 
Closed-loop  training  of  SVM  classifiers.  The  automatic 
labeling  approach  above  can  sometimes  generate  contradic¬ 
tory  labels  for  the  features  with  similar  appearance.  This 
commonly  occurs  in  visual  elements  that  appear  in  both  the 
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Figure  4:  Initial  training  data  generation.  Positive  and  negative 
training  examples  are  depicted  in  green  and  blue,  respectively. 

transient  and  the  static  objects.  In  Figure  4,  for  example, 
text  on  buses  (b)  and  t-shirts  (e)  is  assigned  a  negative  la¬ 
bel,  while  text  on  buildings  and  store  signs  (d)  belongs  to 
the  positive  set.  A  limited  field-of-view  overlap  between  a 
training  image  and  a  ground  truth  image  can  also  lead  to 
such  contradictory  labelings.  Windows  on  the  same  build¬ 
ing,  for  instance,  can  be  assigned  to  different  labels  due 
to  their  visibility  in  the  ground-truth  reference  image  Igt- 
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(a)  Training  images  (b)  Matching  (c)  Retrieved  images  (d)  Training  features  (e)  Bottom-up  clustering  (f)  Bank  of  SVM  classifiers 


Figure  5:  Overview  of  our  training  framework.  For  each  training  images  that  have  GPS-tags  (a),  we  retrieve  top  n  images  from  the 
reference  set  (b-c).  Positive  labels  are  assigned  to  features  that  have  higher  matching  score  in  the  ground- truth  reference  image  than  in  the 
falsely  retrieved  reference  images,  with  a  margin  greater  than  thres.  Negative  labels  are  assigned  in  a  similar  manner,  (d).  To  handle  noise 
and  high  intra-class  variation,  we  use  bottom-up  clustering  technique,  refining  the  positive  set  as  well  as  training  SVMs  iteratively  (e-f). 
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Figure  6:  Top  elements  in  the  final  clusters  with  a  high  ratio  of 
positive  labels.  Each  half  row  corresponds  to  different  clusters. 


Figure  7 :  Final  negative  set  elements  aligned  according  to  their 
initial  clusters.  Each  half  row  corresponds  to  different  clusters. 

Such  contradictory  labeling  on  similar  features  limits  the 
prediction  accuracy. 

On  the  other  hand,  there  exists  high  intra-class  variation 
in  both  the  positive  and  negative  classes:  Windows  have 
different  appearances  from  text,  for  example,  yet  features 
from  both  appear  in  the  same  class.  Training  a  single  classi¬ 
fier  over  the  entire  data  may  be  negatively  affected  by  such 
intra-class  variation. 

To  solve  the  problems  of  contradictory  labelings  and 
intra-class  variation,  we  perform  bottom-up  clustering  [9] 
on  the  initial  training  feature  set.  By  doing  so,  we  obtain 
clusters  of  training  examples  whose  appearances  and  the  la¬ 
bels  are  most  consistent,  as  well  as  a  bank  of  linear  SVM 
classifiers  that  are  trained  within  each  cluster.  Each  training 


example  constructs  a  cluster  by  finding  k  nearest  neighbors 
in  the  training  set.  Redundant  sets  whose  top  ranked  ele¬ 
ments  overlap  with  existing  sets  are  eliminated.  If  a  cluster 
has  a  high  ratio  of  negative  labels,  the  negative  examples 
in  that  cluster  are  assigned  to  the  final  negative  set  A f,  and 
the  positive  ones  are  discarded.  For  the  remaining  clusters 
Ci,  a  linear  SVM  is  iteratively  trained  on  the  positive  ex¬ 
amples  in  each  cluster,  using  J\f  as  the  negative  set  for  hard 
negative  mining  (Figure  5 (e-f)).  As  the  SVM  uses  its  true¬ 
positive  firings  for  the  re-training  in  the  iterative  procedure, 
clusters  are  left  with  features  having  consistent  appearances 
and  labels.  Similar  to  [9],  the  clusters  and  J\f  are  divided 
into  three  sets  to  avoid  overfitting.  We  only  keep  the  SVM 
classifiers  with  an  accuracy  rate  greater  than  0.8.  Finally, 
we  remove  redundant  classifiers  whose  weight  vectors  have 
a  high  cosine  similarity  with  that  of  other  classifiers  as  in 
[20].  Examples  of  top  elements  in  Ci  are  shown  in  Figure  6. 
Figure  7  shows  elements  in  J\T,  which  are  aligned  according 
to  their  initial  clusters.  Interestingly,  although  our  approach 
makes  no  assumption  on  features  that  are  useful  for  geo¬ 
localization,  we  can  observe  semantic  relationships  emerge 
through  the  learning  process.  Namely,  windows,  charac¬ 
teristic  wall  patterns,  and  letters  on  signage  are  detected 
as  positive  elements,  while  features  from  trees,  people,  car 
wheels,  pavements,  and  edges  are  considered  as  negative 
elements. 

In  the  querying  phase,  we  feed  query  image  features  into 
the  bank  of  linear  SVM  classifiers.  We  accumulate  predic¬ 
tions  from  each  classifier  to  compute  the  confidence  score 
of  a  feature  being  good  for  geo-localization  (Figure  10  (b)), 
weighting  them  using  the  discriminativeness  [34]  of  the 
classifier,  which  is  the  ratio  of  number  of  firings  in  its  cluster 
Ci  over  that  in  the  entire  training  set,  in  order  to  compen¬ 
sate  for  the  distribution  of  visual  elements  that  each  clus¬ 
ter  spans.  We  discard  features  with  a  low  confidence  score 
and  keep  only  the  remaining  features  for  performing  geo¬ 
localization  (Figure  10  (c)). 

Implementation  details.  For  generating  the  training  set, 
we  define  the  Iqt  image  set  as  reference  images  that  are 
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within  50  meters  from  the  given  GPS  location  and  passed 
geometric  verification  w.r.t.  the  training  image  by  fitting  a 
fundamental  matrix.  For  /fp,  we  took  reference  images 
that  are  retrieved  within  the  top  n  (n  =100),  and  at  least 
270m  away  from  the  given  GPS  location.  This  accounts 
for  both  user-provided  geo-tag  errors  and  the  fact  that  large, 
symmetric  buildings  are  often  observable  from  extended  ar¬ 
eas.  Before  comparing  f(pt,  Igt )  and  /(pt,  Iff),  we  nor¬ 
malize  the  matching  scores  by  multiplying  to  com¬ 

pensate  for  a  non-uniform  distribution  of  features.  For  train¬ 
ing  and  predicting,  we  separated  features  into  three  scale 
levels  based  on  the  size  of  the  MSER,  as  we  observed  that 
the  distribution  of  positive  and  negative  PBVLAD  features 
varies  in  different  scales.  The  number  of  SVM  classifiers 
used  in  each  level  were  35,  150,  and  25. 

4.  Experiments 

4.1.  Image  Geo-localization 

Dataset.  For  the  reference  image  set  Ir,  we  collected 
27,520  geo-registered  Google  Street  View  images  covering 
the  Pittsburgh  (U.S.)  area.  These  images  contain  8  overlap¬ 
ping  perspective  views  extracted  from  the  spherical  panora¬ 
mas  in  two  different  yaw  directions,  to  capture  both  eye- 
level  street  views  and  the  higher  parts  of  the  building  in  ur¬ 
ban  environments.  This  setting  is  similar  to  those  used  in 
[9,  12,  37].  The  co-located  GPS-tagged  training  image  set 
It,  comprising  positive  and  negative  training  data  C{  s  and 
A f  for  learning,  was  downloaded  from  Flickr  and  consisted 
of  850  images  that  were  successfully  registered  to  the  near¬ 
est  elements  in  Ir  through  geometric  verification.  The  test 
image  set  Iq  was  formed  by  145  internet  collection  images 
from  the  query  set  of  [42]  with  manually  verified  GPS-tags. 
Results.  We  compare  the  proportion  of  correctly  localized 
image  among  a  ranked  list  of  top  n  candidates.  All  of  our 
results  are  without  post-processing  such  as  geometric  re¬ 
ranking  [29].  We  consider  an  image  to  be  localized  if  it  is 
within  35m  from  the  ground  truth  location.  For  a  baseline, 
we  compare  with  our  implemented  version  of  [42]  We  also 
compare  a  variant  of  [42]  with  SIFT  feature  selection  by 
pre-trained  linear  SVM  in  a  procedure  similar  to  our  selec¬ 
tion  of  PBVLAD  features  (SIFT  Select). 

Figure  8  depicts  how  our  systems  with  selected  PB¬ 
VLAD  (PBVLAD  Select)  and  all  PBVLAD  (PBVLAD  All) 
consistently  outperform  the  baseline  methods.  Feature  se¬ 
lection  is  more  successful  in  PBVLAD  than  SIFT.  The  per¬ 
formance  of  using  selected  features  is  consistently  better 
than  using  all  features  in  PBVLAD,  whereas  this  behavior 
alternates  when  considering  SIFT  features. 

The  performance  at  the  top  of  the  shortlist  (n  =  1)  dis¬ 
played  in  Table  1.  Our  method  achieves  a  recall  of  64.83% 
using  all  features  and  improves  to  68.28%  with  selected  fea¬ 
tures,  while  the  best  baseline  method  (SIFT  Select)  obtains 


Method 

%  Correct 

PBVLAD  All 

PBVLAD  Select 
PBVLAD  Random 
PBVLAD  Select0 
SIFT  All  [42] 

SIFT  Select 

Chance 

64.83 

68.28 

33.38 

19.31 

49.66 

46.90 

0.20 

Table  1 :  Proportion  of  correctly  localized  images  at  top  1 


Fraction  of  Correctly  Localized  Images  (%) 


Figure  8:  Geo-localization  performance 


49.66%.  We  also  tested  the  performance  of  the  system  us¬ 
ing  the  same  number  of  PBVLAD  features  as  our  selec¬ 
tion  framework,  but  that  are  picked  randomly  (PBVLAD 
Random).  Its  poor  recall  rate  supports  the  effectiveness  of 
our  selection  mechanism,  illustrating  how  simply  selecting 
fewer  features  does  not  generally  improve  the  performance. 
Moreover,  we  also  tested  with  the  features  that  are  not  se¬ 
lected  by  our  framework  (PBVLAD  Selectc)  to  illustrate 
how  discarded  features  are  in  general  detrimental  to  the 
geo-localization.  The  random  chance  of  retrieving  correct 
images  is  0.2  %,  which  reflects  difficulty  of  the  dataset. 

Figure  9  shows  examples  of  our  results  using  PBVLAD. 
The  top  four  retrieved  images  are  shown  for  each  query  im¬ 
age.  As  can  be  seen,  our  method  retrieves  correct  reference 
images  despite  partial  occlusions  and  changes  in  viewpoint, 
illumination,  and  scale.  Figure  10  depicts  other  examples 
where  PBVLAD  Select  outperforms  PBVLAD  All. 

We  attribute  the  enhanced  performance  of  PBVLAD- 
based  retrieval  to  the  increased  discrimination  power  pro¬ 
vided  by  aggregated  features.  Figure  1 1  (b)  illustrates  the 
maximum  obtained  feature  similarity  score  for  the  features 
within  a  query  image  (a)  w.r.t.  the  entire  reference  dataset. 
We  can  observe  that  PBVLAD  features  in  foliage  image  re¬ 
gions  are  not  highly  matched  to  the  reference  set.  Where 
individual  SIFT  features  may  have  many  similar  features  in 
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Figure  9:  Example  result  (left)  Query  images,  (right)  Top  four  retrieved  images  using  our  proposed  PBYLAD.  Query  images  are  of  various 
sizes. 


(a)  (b)  (c)  (d)  (e) 


Figure  10:  Qualitative  comparison  of  retrieved  image  using  selected  PBVLAD  and  using  all  of  the  features,  (a)  Query  image  (b)  Heat 
map  representation  of  confidence  being  a  good  feature  (c)  Selected  features  (green: selected,  blue:discarded.)  (c)  retrieved  image  using 
selected  features  (d)  retrieved  image  using  all  features. 
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(a)  (b)  (c) 

Figure  11:  (a)  Query  image  (b)  Heat  map  of  maximum  matching 
scores  max(/(pq,  Ir))  of  each  features  pq.  (c)  Confidence  scores 


Query  image  Ground-truth  Retrieved  image 


Figure  12:  Failure  cases.  Retrieved  images  are  more  than  100m 
away  from  ground- truth  locations. 

the  dataset,  the  analysis  of  their  local  ensembles  is  more  dis¬ 
criminative.  Moreover,  our  final  predicted  feature  scores  (c) 
illustrate  how  our  framework  discriminates  good  features 
prior  to  direct  feature  similarity  estimation. 

Failure  cases.  There  are  many  cases  where  the  ranked 
list  contained  the  same  building  in  the  query  image,  but  at 
different  locations.  The  first  and  second  row  of  Figure  12 
show  such  examples.  This  occurs  often  for  images  depict¬ 
ing  a  large  and  symmetric  buildings.  In  many  cases,  the 
building  itself  looked  more  similar  to  the  retrieved  image 
than  the  ground-truth  reference  image.  Another  observation 
is  that  when  it  comes  to  severe  scale  changes,  the  number  of 
SIFT  keypoints  detected  within  the  MSER  region  is  reduced 
due  to  lack  of  details.  In  such  cases,  it  becomes  hard  to 
match  a  PBVLAD  as  many  of  its  group  members  are  miss¬ 
ing.  This  could  be  alleviated  by  using  spectral  SIFT  [22], 
or  by  only  including  keypoints  detected  within  some  scale 
range  from  the  MSER  region  similar  to  [6] . 

4.2.  PBVLAD  for  general  image  retrieval 

We  evaluate  PBVLAD  as  a  descriptor  for  image  retrieval 
on  the  Oxford5k  Buildings  dataset  [2'  ].  Table  2  com¬ 
pares  our  method  against  state-of-the-art  image  retrieval 
approaches  [10,  19],  which  includes  VLAD,  Fisher  vector 
(FV),  and  a  bag-of- words  baseline.  The  evaluation  was  per¬ 
formed  without  dimensionality  reduction  for  all  methods. 
PBVLAD  shows  competitive  performance  to  other  state-of- 
the-art  descriptors.  Table  3  shows  the  effect  of  dimension 
reduction  using  PCA.  The  decrease  in  the  performance  is 
not  significant  until  the  dimension  is  reduced  to  12.5%. 


Descriptor 

#  Vocabulary 

mAP 

BoW  [19] 

200,000 

0.364 

BoW  [19] 

20,000 

0.319 

Fisher  [19] 

64 

0.317 

VLAD  [1(  ] 

128 

0.339 

PBVLAD 

128 

0.369 

Table  2:  Comparative  image  retrieval  performance  of  PBVLAD 
on  the  Oxford  5k  dataset.  The  accuracy  is  measured  by  the  mean 
Average  Precision  (mAP).  All  descriptors  are  uncompressed. 


Full 

Dim  Reduced 

Dim 

16384 

8192 

4096 

2048 

1024 

mAP 

0.369 

0.364 

0.334 

0.264 

0.210 

Table  3:  Retrieval  performance  of  PBVLAD  on  Oxford  5k 
dataset,  before  and  after  the  dimensionality  reduction  using  PCA. 
The  accuracy  is  measured  by  the  mean  Average  Precision  (mAP) 

5.  Conclusion 

In  this  work,  we  proposed  per-bundle  vector  of  locally 
aggregated  descriptors  (PBVLAD)  for  maximally  stable  re¬ 
gions  in  an  image.  PBVLAD  provides  a  convenient  and 
effective  representation  for  classification  of  grouped  local 
features.  Using  this  descriptor  and  a  geo-tagged  internet  im¬ 
age  collection,  good/bad  features  for  geo-localization  were 
exploited  with  the  notion  of  good/bad  being  explicitly  de¬ 
fined  in  terms  of  the  feature’s  contribution  to  the  retrieval 
process.  To  remove  noisy  labels  and  deal  with  the  large 
intra-class  variation,  bottom-up  clustering  was  performed, 
generating  a  bank  of  SVM  classifiers.  At  the  query  phase, 
outputs  of  each  classifiers  were  accumulated  to  select  good 
features.  The  experimental  results  show  an  improvement  in 
the  geo-localization  accuracy  when  only  good  features  pre¬ 
dicted  by  our  algorithm  were  used. 
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